Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hardcoded sanity-max state size for case-insensitive matching #168

Merged
merged 1 commit into from
Oct 14, 2022

Conversation

unhammer
Copy link
Member

@unhammer unhammer commented Oct 13, 2022

As suggested in #167 (comment) , stop doing the case-insensitive matching when we've got a high number of State sequences.

Currently 65536, quite high but at least within what most modern machines can deal with.

Also, delete FSTProcessor.current_state since confusingly all the processors (except transliteration) make a local State called current_state

$ cat a.dix 
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
<alphabet/>
<sdefs>
  <sdef n="guess"       c="Guesser"/>
</sdefs>

<pardefs>
<pardef n="a-zA-Z+">
  <e><re>[a-zA-Z]+</re></e>
</pardef>
</pardefs>

<section id="regex" type="standard">
<e><par n="a-zA-Z+"/><p><l/><r><s n="guess"/></r></p></e>
</section>
</dictionary>

$ lt-comp lr a.dix a.bin
regex@standard 3 105

$ echo 'BADGERBADGERBADGERBAD' | \time lt-proc -w a.bin |head -c50 # BEFORE patch
^BADGERBADGERBADGERBAD/bAdGeRbAdGeRbAdGeRBAD<guessCommand terminated by signal 13
5.31user 1.08system 0:06.39elapsed 100%CPU (0avgtext+0avgdata 3815888maxresident)k
0inputs+0outputs (0major+1053716minor)pagefaults 0swaps

$ echo 'BADGERBADGERBADGERAD' | \time lttoolbox/lt-proc -w a.bin |head -c50 # AFTER patch
^BADGERBADGERBADGERAD/bAdGeRbAdGeRBADGERAD<guess>/Command terminated by signal 13
0.16user 0.03system 0:00.20elapsed 102%CPU (0avgtext+0avgdata 65248maxresident)k
0inputs+0outputs (0major+17537minor)pagefaults 0swaps

$ lt-comp rl a.dix g.bin
regex@standard 3 105

$ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lt-proc -g g.bin # BEFORE patch
BADGERBADGERBADGERBAD
3.21user 0.74system 0:03.96elapsed 99%CPU (0avgtext+0avgdata 2544316maxresident)k
0inputs+0outputs (0major+704011minor)pagefaults 0swaps

$ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lttoolbox/lt-proc -g g.bin # AFTER patch
BADGERBADGERBADGERBAD
0.02user 0.01system 0:00.02elapsed 118%CPU (0avgtext+0avgdata 8344maxresident)k
0inputs+0outputs (0major+2079minor)pagefaults 0swaps

# b.bin from issue #167 
$ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lt-proc -b b.bin # BEFORE patch
^BADGERBADGERBADGERBAD<guess>/BADGERBADGERBADGERBAD<guess>$
3.96user 0.91system 0:04.87elapsed 99%CPU (0avgtext+0avgdata 2544488maxresident)k
0inputs+0outputs (0major+704013minor)pagefaults 0swaps

$ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lttoolbox/lt-proc -b b.bin # AFTER patch
^BADGERBADGERBADGERBAD<guess>/BADGERBADGERBADGERBAD<guess>$
0.21user 0.02system 0:00.23elapsed 100%CPU (0avgtext+0avgdata 82504maxresident)k
0inputs+0outputs (0major+21817minor)pagefaults 0swaps

Tested both nob→dan (uses lt-proc -g) and nob→nno (uses lt-proc -g -b and lt-proc -p), no diffs on 40k lines of corpus, except nob→dan actually manages to get through its corpus now :)

@unhammer unhammer requested a review from mr-martian October 13, 2022 11:32
Copy link
Contributor

@mr-martian mr-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

currently 65536, quite high but at least within what most modern
machines can deal with

Also, delete FSTProcessor.current_state since confusingly all the
processors (except transliteration) make a local State called
current_state

Should help a bit against #167
@unhammer unhammer force-pushed the 167-max-state-size-sanity branch from ffda28d to 797829a Compare October 14, 2022 13:18
@unhammer unhammer merged commit 9b3132e into master Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants